perm filename PROPOS[7,ALS] blob
sn#050696 filedate 1973-06-26 generic text, type T, neo UTF8
00010 June 26 1973
00020
00030 A Proposal for Speech Understanding Research
00040
00050
00060 It is proposed that the work on speech recognition that is
00070 now under way in the A.I. project at Stanford University be continued
00080 and extended as a separate project with broadened aims in the field
00090 of speech understanding. This work gives considerable promise both of
00100 solving some of the immediate problems that beset speech
00110 understanding research and of providing a basis for future advances.
00120
00130 It is further proposed that this work be more closely tied to
00140 the ARPA Speech Understanding Research effort than it has been in the
00150 past and that it have as its express aim the study and application to
00160 speech recognition of a machine learning process, that has proved
00170 highly successful in another application and that has already been
00180 tested out to a limited extent in speech recognition. The machine
00190 learning process offers both an automatic training scheme and the
00200 inherent ability of the system to adapt to various speakers and
00210 dialects. Speech recognition via machine learning represents a global
00220 approach to the speech recognition problem and can be incorporated
00230 into a wide class of limited vocabulary systems.
00240
00250 Finally we would propose accepting responsibility for keeping
00260 other ARPA projects supplied with operating versions of the best
00270 current programs that we have developed. The availability of the high
00280 quality front end that the signature table approach provides would
00290 enable designers of the various over-all systems
00300 to test the relative performance of the top-down portions of their
00310 systems without having to make allowances for the deficiencies
00320 of their currently available front ends. Indeed, if the signature table
00330 scheme can be made simple enough to compete on a time basis (and we
00340 believe that it can) then it may replace the other front end
00350 schemes that are currently in favor.
00360
00370 Stanford University is well suited as the site for such work,
00380 having both the facilities for this work and a staff of people with
00390 experience and interest in machine learning, phonetic analysis, and
00400 digital signal processing. The staff at present consists of the
00410 proposed Principal Investigator Arthur l. Samuel, one post-doctoral
00420 staff member Ravindra Thosar, who
00430 had worked on speech recognition and synthesis in India, a
00440 second member Dr. Neil Miller, who has had considerable signal-processing
00450 experience and a few graduate students. It is anticipated that a staff
00460 of not more than 3 full time members with the help of 3 or 4 graduate
00470 students could mount a meaningful program, which should be funded for a
00480 mimimum of two years to ensure continuity of effort.
00490 We would expect to demonstrate the utility of the
00500 Signature Table approach within this time span and to provide a working
00510 system that could be used as the front end for any of the
00520 Speech Understanding Systems that are currently under
00530 development or are being planned.
00550
00560 Ultimately we would
00570 like to have a system capable of understanding speech from an
00580 unlimited domain of discourse and with an unknown speaker. It seems not
00590 unreasonable to expect the system to deal with this situation very
00600 much as people do when they adapt their understanding processes to
00610 the speakers idiosyncrasies during the conversation. The signature table
00620 method gives promise of contributing toward the solution of this
00630 problem as well as being a
00640 possible answer to some of the more immediate problems.
00650
00660 The initial thrust of the proposed work would be toward the
00670 development of adaptive learning techniques, using the signature
00680 table method and some more recent varients and extentions of this
00690 basic procedure. We have already demonstrated the usefulness of this
00700 method for the initial assignment of significant features to the
00710 acoustic signals. One of the next steps will be to extend the method
00720 to include acoustic-phonetic probabilities in the decision process.
00730
00740 Still another aspect to be studied would be the amount of
00750 preprocessing that should be done and the desired balance between
00760 bottom-up and top-down approaches. It is fairly obvious that
00770 decisions of this sort should ideally be made dynamicallly depending
00780 upon the familiarity of the system with the domain of
00790 discourse and with the characteristics of the speaker.
00800 Compromises will undoubtedly have to be made in any immediately
00810 realizable system but we should understand better than we now do the
00820 limitations on the system that such compromises impose.
00830
00840 It may be well at this point to describe the general
00850 philosophy that has been followed in the work that is currently under
00860 way and the results that have been achieved to date. We have been
00870 studying elements of a speech recognition system that is not
00880 dependent upon the use of a limited vocabulary and that can recognize
00890 continuous speech by a number of different speakers.
00900
00910 Such a system should be able to function successfully either
00920 without any previous training for the specific speaker in question or
00930 after a short training session in which the speaker would be asked to
00940 repeat certain phrases designed to train the system on those phonetic
00950 utterances that seemed to depart from the previously learned norm. In
00960 either case it is believed that some automatic or semi-automatic
00970 training system should be employed to acquire the data that is used
00980 for the identification of the phonetic information in the speech. We
00990 believe that this can best be done by employing a modification of the
01000 signature table scheme previously discribed. A brief review of this
01010 earlier form of signature table is given in Appendix 1.
01020
01030 The over-all system is envisioned as one in which the more or
01040 less conventional method is used of separating the input speech into
01050 short time slices for which some sort of frequency analysis,
01060 homomorphic, LPC, or the like, is done. We then interpret this
01070 information in terms of significant features by means of a set of
01080 signature tables. At this point we define longer sections of the
01090 speech called segments which are obtained by grouping together varying
01100 numbers of the original slices on the basis of their similarity. This
01110 then takes the place of other forms of initial segmentation. Having
01120 identified a series of in this way we next use another set of
01130 signature tables to extract information from the sequence of segments
01140 and combine it with a limited amount of syntactic and semantic
01150 information to define a sequence of phonemes.
01160
01170 While it would be possible to extend this bottom up approach
01180 still further, it seems reasonable to break off at this point and
01190 revert to a top down approach from here on. The real difference in
01200 the overall system would then be that the top down analysis would
01210 deal with the outputs from the signature table section as its
01220 primitives rather than with the outputs from the initial measurements
01230 either in the time domain or in the frequency domain. In the case of
01240 inconsistencies the system could either refer to the second choices
01250 retained within the signature tables or if need be could always go
01260 clear back to the input parameters. The decision as to how far to
01270 carry the initial bottom up analysis must depend upon the relative
01280 cost of this analysis both in complexity and processing time and the
01290 certainty with which it can be performed as compared with the costs
01300 associated with the rest of the analysis and the certainty with which
01310 it can be performed, taking due notice of the costs in time of
01320 recovering from false starts.
01330
01340 Signature tables can be used to perform four essential
01350 functions that are required in the automatic recognition of speech.
01360 These functions are: (1) the elimination of superfluous and
01370 redundant information from the acoustic input stream, (2) the
01380 transformation of the remaining information from one coordinate
01390 system to a more phonetically meaningful coordinate system, (3) the
01400 mixing of acoustically derived data with syntactic, semantic and
01410 linguistic information to obtain the desired recognition, and (4) the
01420 introduction of a learning mechanism.
01430
01440 The following three advantages emerge from this method of
01450 training and evaluation.
01460 1) Essentially arbitrary inter-relationships between the
01470 input terms are taken in account by any one table. The only loss of
01480 accuracy is in the quantization.
01490 2) The training is a very simple process of accumulating
01500 counts. The training samples are introduced sequentially, and hence
01510 simultaneous storage of all the samples is not required.
01520 3) The process linearizes the storage requirements in the
01530 parameter space.
01540
01550 The signature tables, as used in speech recognition, must be
01560 particularized to allow for the multi-category nature of the output.
01570 Several forms of tables have been investigated. Details of the current
01580 system are given in Appendix 2. For some early results see
01590 SUR Note 43 "Some Preliminary Experiments in Speech Recognition
01600 Using Signature Tables" by R.B.Thosar and A.L.Samuel.
01620
01630 Work is currently under way on a major refinement of the
01640 signature table approach which adopts a somewhat more rigorous
01650 procedure. Preliminary results with this scheme indicate that a
01660 substantial improvement has been achieved. This effort is described in
01670 a recent report SUR Note 81 on "Estimation of Probability Density Using
01680 Signature Tables for Application to Pattern Recognition, by
01690 R.B.Thosar.
01700
01720 We are currently involved in work on a segmentation
01730 procedure which has already demonstrated its ability to compete with other
01740 proposed segmentation systems, even when used to process speech from
01750 speakers whose utterances were not used during the training
01760 sequence.
00010 RESEARCH GRANT BUDGET
00020
00030 TWO YEARS BEGINNING OCTOBER 1, 1973
00040
00050
00060 BUDGET CATEGORY YEAR 1 YEAR 2
00070 -----------------------------------------------------------------
00080 I. SALARIES & WAGES:
00090
00100 Samuel, A.,
00110 Senior Research Associate
00120 Principal Investigator, 75% 20,000 20,000
00130
00140 ------,
00150 Research Associate 14,520 14,520
00160
00170 Miller, N.,
00180 Research Associate 13,680 13,680
00190
00200 ------,
00210 Student Research Assistant,
00220 50% academic year, 100% summer 4,914 5,070
00230
00240 ------,
00250 Student Research Assistant,
00260 50% academic year, 100% summer 4,914 5,070
00270
00280 Reserve for Salary Increases
00290 @ 5.5% per year 2,901 5,980
00300 ------- -------
00310
00320 TOTAL SALARIES AND WAGES $60,929 $64,320
00330
00340 II. STAFF BENEFITS:
00350
00360 17.0% 10-1-73 to 8-31-74 9,495
00370 18.3% 9-1-74 to 8-31-75 929 10,790
00380 19.3% 9-1-75 to 9-30-75 1,034
00390 ------- -------
00400 TOTAL STAFF BENEFITS $10,424 $11,824
00410
00420 III. TRAVEL:
00430
00440 Domestic -
00450 Local 150
00460 East Coast 450
00470 ---
00480 $600 $600
00490
00500 IV. EXPENDABLE MATERIALS & SERVICES:
00510
00520 A. Telephone Service 480
00530 B. Office Supplies 600
00540 ---
00550 $1,080 $1,080
00560
00570 V. PUBLICATIONS COST:
00580
00590 2 Papers @ 500 ea. $1,000 $1,000
00600 ------- -------
00610
00620 VI. TOTAL DIRECT COSTS:
00630
00640 (Items I through V) $74,033 $78,824
00650
00660 VII. INDIRECT COSTS:
00670
00680 On Campus - 47% of NTDC $34,796 $37,047
00690
00700 ------- -------
00710 VIII. TOTAL COSTS:
00720
00730 (Items VI + VII) $108,829 $115,871
00740 -------- --------
00750 -------- --------
00010
00020
00030
00040 COGNIZANT PERSONNEL
00050
00060
00070 For contractual matters:
00080
00090 Office of the Research Administrator
00100 Stanford University
00110 Stanford, California 94305
00120
00130 Telephone: (415) 321-2300, ext. 2883
00140
00150 For technical and scientific matters regarding this proposal:
00160
00170 Prof. John McCarthy
00180 Computer Science Department
00190 Stanford University
00200 Stanford, California 94305
00210
00220 Telephone: (415) 321-2300, ext. 4971
00230
00240 For administrative matters, including questions relating
00250 to the budget or property acquisition:
00260
00270 Mr. Lester D. Earnest
00280 Computer Science Department
00290 Stanford University
00300 Stanford, California 94305
00310
00320 Telephone: (415) 321-2300, ext. 4971
00330
00010 FACILITIES
00020
00030 The computer facilities of the Stanford Artificial Intelligence
00040 Laboratory include the following equipment.
00050
00060 Central Processors: Digital Equipment Corporation PDP-10 and PDP-6
00070
00080 Primary Store: 65K words of 1.7 microsecond DEC Core
00090 65K words of 1 microsecond Ampex Core
00100 131K words of 1.6 microsecond Ampex Core
00110
00120 Swapping Store: Librascope disk (5 million words, 22 million
00130 bits/second transfer rate)
00140
00150 File Store: IBM 3330 disc file, 6 spindles (leased)
00160
00170 Peripherals: 4 DECtape drives, 2 mag tape drives, line printer,
00180 Calcomp plotter, Xerox Graphics Printer
00190
00200 Communications
00210 Processor: BBN IMP (Honeywell DDP-516) connected to the
00220 ARPA network.
00230
00240 Terminals: 58 TV displays, 6 III displays, 3 IMLAC displays,
00250 1 ARDS display, 15 Teletype terminals
00260
00270 Special Equipment: Audio input and output systems, hand-eye
00280 equipment (2 TV cameras, 3 arms), remote-
00290 controlled cart
01770
00010
00020 Appendix 1
00030
00040 The early form of a signature table
00050
00060 For those not familiar with the use of signature tables as
00070 used by Samuel in programs which played the game of checkers, the
00080 concept is best illustrated (Fig.1) by an arrangement of tables used
00090 in the program. There are 27 input terms. Each term evaluates a
00100 specific aspect of a board situation and it is quantized into a
00110 limited but adequate range of values, 7, 5 and 3, in this case. The
00120 terms are divided into 9 sets with 3 terms each, forming the 9 first
00130 level tables. Outputs from the first level tables are quantized to 5
00140 levels and combined into 3 second level tables and, finally, into one
00150 third-level table whose output represents the figure of merit of the
00160 board in question.
00170
00180 A signature table has an entry for every possible combination
00190 of the input vector. Thus there are 7*5*3 or 105 entries in each of
00200 the first level tables. Training consists of accumulating two counts
00210 for each entry during a training sequence. Count A is incremented
00220 when the current input vector represents a prefered move and count D
00230 is incremented when it is not the prefered move. The output from the
00240 table is computed as a correlation coeficient
00250 C=(A-D)/(A+D).
00260 The figure of merit for a board is simply the
00270 coefficient obtained as the output from the final table.
00010
00020 Appendix 2
00030
00040 Initial Form of Signature Table for Speech Recognition
00050
00060 The signature tables, as used in speech recognition, must be
00070 particularized to allow for the multi-catagory nature of the output.
00080 Several forms of tables have been investigated. The initial form
00090 tested and used for the data presented in the attached paper uses
00100 tables consisting of two parts, a preamble and the table proper. The
00110 preamble contains: (1) space for saving a record of the current and
00120 recent output reports from the table, (2) identifying information as
00130 to the specific type of table, (3) a parameter that identifies the
00140 desired output from the table and that is used in the learning
00150 process, (4) a gating parameter specifying the input, that is to be
00160 used to gate the table, (5) the sign of the gate,
00170 (6) the gating level to be used and (7)
00180 parameters that identify the sources of the normal inputs to the
00190 table.
00200
00210 All inputs are limited in range and specify either the
00220 absolute level of some basic property or more usually the probability
00230 of some property being present. These inputs may be from the original
00240 acoustic input or they may be the outputs of other tables. If from
00250 other tables they may be for the current time step or for earlier
00260 time steps, (subject to practical limits as to the number of time
00270 steps that are saved).
00280
00290 The output, or outputs, from each table are similarly limited
00300 in range and specify, in all cases, a probability that some
00310 particular significant feature, phonette, phoneme, word segment, word
00320 or phrase is present.
00330
00340 We are limiting the range of inputs and outputs to values
00350 specified by 3 bits and the number of entries per table to 64
00360 although this choice of values is a matter to be determined by
00370 experiment. We are also providing for any of the following input
00380 combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
00390 (3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
00400 The uses to which these differint forms are put will be described
00410 later.
00420
00430 The body of each table contains entries corresponding to
00440 every possible combination of the allowed input parameters. Each
00450 entry in the table actually consists of several parts. There are
00460 fields assigned to accumulate counts of the occurrances of incidents
00470 in which the specifying input values coincided with the different
00480 desired outputs from the table as found during previous learning
00490 sessions and there are fields containing the summarized results of
00500 these learning sessions, which are used as outputs from the table.
00510 The outputs from the tables can then express to the allowed accuracy
00520 all possible functions of the input parameters.
00530
00540 Operation in the Training Mode
00550
00560 When operating in the training mode the program is supplied
00570 with a sequence of stored utterances with accompanying phonetic
00580 transcriptions. Each sample of the incoming speech signal is
00590 analysed (Fourier transforms or inverse filter equivalent) to obtain
00600 the necessary input parmeters for the lowest level tables in the
00610 signature table hierarchy. At the same time reference is made to a
00620 table of phonetic "hints" which prescribe the desired outputs from
00630 each table which correspond to all possible phonemic inputs. The
00640 signature tables are then processed.
00650
00660 The processing of each table is done in two steps, one
00670 process at each entry to the table and the second only periodically.
00680 The first process consists of locating a single entry line within the
00690 table as specified by the inputs to the table and adding a 1 to the
00700 appropriate field to indicate the presence of the property specified
00710 by hint table as corresponding to the phoneme specified in the
00720 phonemic transcription. At this time a report is also made as to the
00730 table's output as determined from the averaged results of previous
00740 learning so that a running record may be kept of the performance of
00750 the system. At periodic intervals all tables are updated to
00760 incorporate recent learning results. To make this process easily
00770 understandable, let us restrict our attention to a table used to
00780 identify a single significant feature say Voicing. The hint table
00790 will identify whether or not the phoneme currently being processed is
00800 to be considered voiced. If it is voiced, a 1 is added to the "yes"
00810 field of the entry line located by the normal inputs to the table. If
00820 it is not voiced, a 1 is added to the "no" field. At updating time
00830 the output that this entry will subsequently report is determined by
00840 dividing the accumulated sum in the "yes" field by the sum of the
00850 numbers in the "yes" and the "no" fields, and reporting this quantity
00860 as a number in the range from 0 to 7. Actually the process is a bit
00870 more complicated than this and it varies with the exact type of table
00880 under consideration, as reported in detail in appendix B. Outputs
00890 from the signature tables are not probabilities, in the strict sense,
00900 but are the statistically-arrived-at odds based on the actual
00910 learning sequence.
00920
00930 The preamble of the table has space for storing twelve past
00940 outputs. An input to a table can be delayed to that extent. This table
00950 relates outcomes of previous events with the present hint-the
00960 learning input. A certain amount of context dependent learning is thus
00970 possible with the limitation that the specified delays are constant.
00980
00990 The interconnected hierarchy of tables form a network which
01000 runs increamentally, in steps synchronous with time window over which
01010 the input signal is analised. The present window width is set at 12.8
01020 ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
01030 to this network are the parameters abstracted from the frequency
01040 analyses of the signal, and the specified hint. The outputs of the
01050 network could be either the probability attached to every phonetic
01060 symbol or the output of a table associated with a feature such as
01070 voiced, vowel etc. The point to be made is that the output generated
01080 for a sample is essentially independent of its contiguous
01090 samples. The dependency achieved by using delayes in the inputs is
01100 invisible to the outputs. The outputs thus report the best estimate on
01110 what the current acoustic input is with no relation to the past
01120 outputs. Relating the successive outputs along the time dimension is
01130 realised by counters.
01140
01150 The Use of COUNTERS
01160
01170 The transition from initial sample space to segment space is
01180 made posible by means of COUNTERS which are summed and reiniated
01190 whenever their inputs cross specified threshold values, being
01200 triggered on when the input exceeds the threshold and off when it
01210 falls below. Momentary spikes are eliminated by specifying time
01220 hysteresis, the number of consecutive samples for which the input
01230 must be above the threshold. The output of a counter provides
01240 information about starting time, duration and average input for the
01250 period it was active.
01260
01270 Since a counter can reference a table at any level in the
01280 hierarchy of tables, it can reflect any desired degree of information
01290 reduction. For example, a counter may be set up to show a section of
01300 speech to be a vowel, a front vowel or the vowel /I/. The counters can
01310 be looked upon to represent a mapping of parameter-time space into a
01320 feature-time space, or at a higher level symbol-time space. It may be
01330 useful to carry along the feature information as a back up in those
01340 situations where the symbolic information is not acceptable to
01350 syntactic or semantic interpretation.
01360
01370 In the same manner as the tables, the counters run completely
01380 independent of each other. In a recognition run the counters may
01390 overlap in arbitrary fashion, may leave out gaps where no counter has
01400 been triggered or may not line up nicely. A properly segmented output,
01410 where the consecutive sections are in time sequence and are neatly
01420 labled, is essential for processing it further. This is achieved by
01430 registering the instants when the counters are triggered or
01440 terminated to form time slices called segments.
01450
01460 An event is the period between successive activation or
01470 termination of any counter. An event shorter than a specified time is
01480 merely ignored. A record of event durations and upto three active
01490 counters, ordered according to their probability, is maintained.
01500
01510 An event resulting from the processing described so far,
01520 represents a phonette - one of the basic speech categories defined as
01530 hints in the learning process. It is only an estimate of closeness to
01540 a speech category , based on past learning. Also each category has a
01550 more-or-less stationary spectral characterisation. Thus a category may
01560 have a phonemic equivalent as in the case of vowels , it may be
01570 common to phoneme class as for the voiced or unvoiced stop gaps or it
01580 may be subphonemic as a T-burst or a K-burst. The choices are based on
01590 acoustic expediency, i.e. optimisation of the learning rather than
01600 any linguistic considerations. However a higher level interpretive
01610 programs may best operate on inputs resembling phonemic
01620 trancription. The contiguous segments may be coalesced into phoneme like
01630 units using diadic or triadic probabilities and acoustic-phonetic
01640 rules particular to the system. For example, a period of silence
01650 followed by a type of burst or a short friction may be combined to
01660 form the corresponding stop. A short friction or a burst following a
01670 nasal or a lateral may be called a stop even if the silence period is
01680 short or absent. Clearly these rules must be specific to the system,
01690 based on the confidence with which durations and phonette categories
01700 are recognised.
01710